SportsStats Olympic Athletes Analysis
This project analyzes over 120 years of Olympic Games data to uncover fascinating insights about athletic performance, participation trends, and country achievements. Using Python and SQL, I explored athlete demographics, medal distributions, and how physical attributes vary across sports. The analysis reveals how gender participation has evolved, which countries excel in different sports, and what physical characteristics correlate with success. These insights help understand the evolution of the Olympics and the factors that contribute to athletic excellence on the world stage.
Project Overview
- Project: Data Cleaning and Analysis
- Date: June 2025
- Category: Data Analysis
- Tools: Python, SQL, Pandas, Matplotlib
- Role: Data Analyst
Project Details
Goal & context. This project analyzes over 120 years of Olympic Games data (1896–2016) to uncover patterns in athlete demographics, medal performance, and participation. The work combines Python (pandas, matplotlib) with SQL (DuckDB) to handle large volume (271K+ rows) and deliver clear insights on countries, sports, gender, and physical attributes—suitable for both narrative reporting and presentation.
Data cleaning & preparation. Raw athlete and NOC/region data were loaded and cleaned in Python: missing values in critical columns (e.g., age, height, weight, NOC, sport) were handled; duplicates were removed and data types were standardized (e.g., integer for age and dimensions). NOC codes were joined to region names for country-level analysis. A clean, analysis-ready dataset was then exported (e.g., to CSV) for use in DuckDB and in visualizations.
Analysis with SQL (DuckDB). DuckDB was used to run efficient SQL queries on the cleaned dataset: medal counts by country, participation by year and sex, physical profiles (e.g., average height/weight/age by sport), and medal efficiency (medals per athlete by country). Aggregations and filters were designed to answer specific questions (top countries, gender trends over time, sport-specific demographics) and to feed both narrative insights and charts.
Visualization & storytelling. Key results were visualized with matplotlib (e.g., bar charts for medal rankings, line charts for participation over time). Charts were labeled and styled for clarity. The analysis was documented in a structured notebook and summarized in a presentation so that findings on Olympic history, gender progress, and country performance could be communicated clearly to a non-technical audience.
Outcome. The project delivers a reproducible Python + SQL workflow, a set of validated insights on Olympic athletes and nations, and presentation-ready visuals that demonstrate the ability to clean, model, query, and visualize a large historical dataset end to end.
Analysis Overview
- Data Cleaning & Preparation
Processed raw Olympic athlete data by handling missing values, removing duplicates, converting data types, and merging with regional information for comprehensive analysis. - Medal Distribution Analysis
Analyzed medal counts across countries, identifying USA, Russia, and Germany as the top three medal-winning nations with 4,383, 3,610, and 3,189 total medals respectively. - Gender Participation Trends
Tracked the evolution of male and female athlete participation over 120+ years, revealing significant growth in women's participation especially after the 1980s. - Physical Attribute Analysis
Examined height, weight, and age distributions across different sports, finding that Basketball athletes average 191cm height while Rhythmic Gymnasts have the youngest average age of 18.9 years. - Medal Efficiency Metrics
Calculated medals per athlete ratios to identify efficient sporting nations, revealing that smaller countries like Jamaica achieve exceptional efficiency with 1.9 medals per athlete.
Explore More Projects
Discover other data analysis projects and interactive dashboards
View All Projects